AntiSplodge: a neural-network-based RNA-profile deconvolution pipeline designed for spatial transcriptomics

Abstract With the current surge of spatial transcriptomics (ST) studies, researchers are exploring the deep interactive cell-play directly in tissues, in situ. However, with the current technologies, measurements consist of mRNA transcript profiles of mixed origin. Recently, applications have been proposed to tackle the deconvolution process, to gain knowledge about which cell types (SC) are found within. This is usually done by incorporating metrics from single-cell (SC) RNA, from similar tissues. Yet, most existing tools are cumbersome, and we found them hard to integrate and properly utilize. Therefore, we present AntiSplodge, a simple feed-forward neural-network-based pipeline designed to effective deconvolute ST profiles by utilizing synthetic ST profiles derived from real-life SC datasets. AntiSplodge is designed to be easy, fast and intuitive while still being lightweight. To demonstrate AntiSplodge, we deconvolute the human heart and verify correctness across time points. We further deconvolute the mouse brain, where spot patterns correctly follow that of the underlying tissue. In particular, for the hippocampus from where the cells originate. Furthermore, AntiSplodge demonstrates top of the line performance when compared to current state-of-the-art tools. Software availability: https://github.com/HealthML/AntiSplodge/.


Supplementary Section 1: Sampling properties
The main strength of AntiSplodge comes from the fact that we sample new synthetic spatial transcriptomics profiles from real scRNA cells. First, theoretical profiles are sampled, examples of these are shown in Supplementary Figure  2. Next, these profiles are converted to synthetic spatial transcriptomics profiles by picking a random number of cell types equal to the count in the profile, where cell type index corresponds to the respective index in the profile. The sampler will start by having an equal chance of each cell type (index), and then go towards the more extreme profiles, where some indices will have higher probabilities of being selected. These probabilities are shuffled for each profile generated, so that which cell type has the highest probability is random for each profile. In this particular example, we have two cell types (Nc=2), and each profile has a cell density of 2 (the number of cell that are used to generate the profile CD min, CD max = 2). A total of 1000 profiles is generated (M=1000). If we for example had [M=1000, CD min=1, CD max=2] we would generate 1000 profiles with CD=1, and 1000 profiles with CD=2, totalling the number of profiles to 2000.
If we look at the profiles generated (in Supplementary Figure 1), we see that profile 1 (index 0) has 1 cell of cell type 0 and 1 cell of cell type 1. This is true for profile 1, 2, 4, 5. For profile 3, we see that cell type 0 has 2 cell for this profile, this is due to the fact that the probability selected cell type 0 two times (remember in the beginning each type has equal probabilities, so this is possible). If we look at the last 5 profiles (indices: 995, 996, 997, 998, 999), we see that 995, 996, 998, and 999 has 2 of the same cell types (because we are now sampling with a higher bias towards one cell), but it is still possible to sample equal profiles like in profile index: 997.
If we turn our attention to the distributions in Supplementary Figure 2. In (a) we see how the sampler does a good job of distributing cell types evenly across the possible profiles that it can generate. We know that in this case, whenever 1 cell type has been contributed with 1 in a profile, the other must be contributed with a 1 to satisfy the CD of 2. There is a total of 1000 profiles being generated and approximately 50% (500) has a 1 for each cell type. The remaining profiles will have a 2 in one of them and 0 in the rest (to satisfy CD=2). Each has approximately 25% of 2's, meaning a total of 50% 2, which leaves 50% of 0's, meaning we have equal distribution of all possible distributions of profiles.
In (c) and (d) we extended the problem to 4 cell types (with M=2000). In (c) all profiles have CD of 4, while in (d) we have 2000 profiles of CD 1, CD 2, CD 3, CD 4. Like in (a) we the distributions are actually equal, but now shared across 4 cell types. In (c), whenever we see a profile with 4 of one cell type, all the (three) others must now have a 0 in their counts. Whenever one cell type has 3, one of the others must have a 1 and the remaining 2 must have a 0. While for profiles with a 2, one of the others can have a 2, or two of the others can each have a 1, with the remaining having a 0. This gives this ladder distribution which can might not seem equal, but it is. For d we see the same distributions with more profiles in the lower counts, as a result of the lower CDs.
In (e) and (f ) we see the generation with CD = 10, but for (e) we generate 1000 profiles and (f ) we generate 100000 profiles. Here the overall distributions stays the same regardless of the number of profiles generated, which is what we would like to see.
In (b) we extend the number of cell types to 9 (with CD=10 and M=100000) which is a more realistic case. Again we see that the distributions are the same across cell types, but with higher concentration of 0s, as a 9 in one type will generate 8 zeroes, one in each of the rest of the cell types. A 8 will generate a 1 in one of the others and zero in the rest of the 7, etc.
For all of the distributions, each cell type is represented an equal number of times (Supplementary Figure 3), regardless of their proportionality in the scRNA dataset that will be used. If we look at Supplementary Figure 3(a), we see that each cell type has approximately 1000 cell counts, 1006 and 994, for 0 and 1 ,respectively. The total number of cell counts can be computed with the formular: while the counts per cell type is then approximately divided among the cell types: which is then divided between the four cell types so that each has approximately 5000 counts.
Conclusion: The distributions grasp all the possible cases of counts, which is in particular nice when working with neural networks in an unbiased way. We recommend using a sufficient high M but with a CD equal (both min and max the same) to the suspected number of CDs in the STs being deconvoluted. If you suspect that many of the spots will have a tendency towards a lower CD but with a few high CDs we recommend generating profiles in a range from the low suspected number of CDs towards the expected high CD. For example in the range of [1,2,3,4,5,6,7,8,9,10].

Supplementary Section 2: error function and run-time experiments Initial experiments
Because of the versatility of AntiSplodge, you can supply it with any error function that is compatibly with Pytorch (including custom functions). Therefore, we tested AntiSplodge using the standard available error functions that makes sense for the deconvolution problem. For this we generated synthetic datasets with various sizes but still small compared to what you should do in an actual anaylsis. Below are the overview of datasets (with and without dropout on the first layer), where the numbers in the brackets corresponds to [#training profiles, #validation profiles, #test profiles]. The indices corresponds to the respective columns in Supplementary Table 1 and Supplementary Table 2. For all experiments, we ran 5 warm restarts with a patience counter of 50, using the default AntiSplodge optimizer (Adam with default parameters), profiles contained 1,562 marker-genes found via logistic regression (one cell type vs. all), and synthetic profiles were generated based on cell samples from the Heart Cell Atlas (https://www.heartcellatlas.org/), using the global dataset without normalization. The train, validation, test-split is 90%, 5%, 5%, respectively.

Extended experiments
This time, we try with larger datasets for L1 and SL1, in the range of 200.000 to 500.000 training samples, only increasing the number of synthethic training profiles. We run these experiments with 100 warm-restarts and a patience threshold of 3.   Based on the results in Supplementary Table 3, it is clear that more training data, means better models (as expected), even with the outlier in column 6 (we ran each experiment once, repeating each experiment multiple times would give us more stable results). It also seems that it is beneficial to use L1 rather than SL1, as L1 on average is approximately 1% point lower than SL1 in terms of JSD.

Realistic-sized experiments
We now try with datasets that are more realistic in terms of how an actual study should proceed, again for L1 and SL1. Only increasing the number of synthethic training profiles. This time in the range of 1 million to 4 million samples. (The synthetic datasets (and their generated profiles), are the same for 1 and 5, 2 and 6 etc.) Looking at the results for realistic-sized experiments (Supplementary Table 5), we see that overall L1 seems to be the best error function (at least for this dataset), for the 4 million synthetic training profiles dataset, it is approximately 1.5%-points lower than SL1. The difference in results for using drop-out vs. not using drop-out seems insignificant, however, the run-time is considerable larger for the drop-out version (Supplementary  because the number of epochs increases as improvements usually comes in smaller error reductions, because of the missing features at random in the drop-out version. We run these experiments with 100 warm-restarts and a patience threshold of 3.
Conclusion: Overall, L1 loss seems to be the best choice of error function, dropout takes approximately double the number of solutions to reach the same plateau as the non-dropout version, so unless it is needed, use the non-dropout version as using a validation check already solves the over-fitting. Also, use as many training samples as your system can handle, up to a reasonable number. You might think, why doesn't we reach the same JSD as in the comparison figure in the main text, this is because the comparison was made with a quality controlled and preprocessed version of the dataset, while these tests use the dataset as is.

Supplementary Section 3: Comparison Cell population
In Supplementary

Marker genes
Below are the marker genes used (in python format, N=1389).
Supplementary Section 4: Developmental human heart analysis Cell population In Supplementary Table 8 is the number of each cell type after preprocessing. Note, this is a rather small scRNA dataset (N=3717 cells), but we can still sample as many synthetic spatial transcriptomics profiles as we wish

Predicted cell type proportions
Based on the trained model (one for all samples, based on the accompanied scRNA) for Application 1, these are the predicted proportions for each cell type, shown for each tissue sample (zoom in to see the numbers).   Table 9: Cell types after preprocessing from the Allen's Mouse Brain Atlas (https://portal.brain -map.org/atlases-and-data/rnaseq/mouse-whole-cortex-and-hippocampus-smart-seq)

Predicted cell type proportions
Based on the trained model (one for all samples, based on the Allen's Mouse Brain Atlas) for Application 2, these are the predicted proportions for each cell type, shown for each tissue sample, more red and means higher proportions (zoom in to see more details).